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ABSTRACT 



A technique for dynamically exploiting affinity, locality, and 
load balancing in scheduling the execution of multi-threaded 
user programs in a multi-processor computer system. 
AfBnity, locality, and load balancing characteristics are 
specified at thread creation time and can be dynamically 
changed during thread execution, either by the user program 
itself or by any other process or entity with sufficient 
privileges and information. A central schedule queue and 
one or more per-processor local schedule queues are used to 
schedule the threads based on the dynamically changing 
affinity, locality, and load balancing characteristics. 
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AFFINITY, LOCALITY, AND LOAD ing Systems Interface (POSIX)— Part 1: System User pro- 

BALANCING IN SCHEDULING USER gram Interface (API)— Amendment 2: Threads Extension [C 

PROGRAM-LEVEL THREADS FOR Language], IEEE, New York, N.Y., IEEE Standard 1003.1c- 

EXECUTION BY A COMPUTER SYSTEM 1995 edition, 1995. See, also ISO/IEC 9945-1 :1990c. 

5 Pthreads implementations are available on most UNIX sys- 

BACKGROUND OF THE INVENTION tems today, 

1 Field of the Invention Most of the early work on thread scheduling concentrates 
The invention relates in general to scheduling on com- on load balancing, where threads are placed in a FIFO-based 
puter systems, and more particularly, to the use of affinity, 1(1 ccntral r " d y I"™ 6 ' ^ xam P le '^lude the Presto 
locality, and load balancing in scheduling user program- 10 s y stem ; B [. own ™ reads s y stem > mA °op schedulmg sys- 
level threads for execution by a computer system. ,era f " , In * ese s y stem *' P«>««0" take threads from tins 
. , central ready queue and run them to completion. The load is 
2. Description of Related Ait evenly balanced) bm this technique does not take advantage 
Multiple processor computer systems are a well known of locality and significant cache misses can occur on specific 
technique for increasing the performance of computer pro- is processors. Also, such schemes scale poorly, 
grams. In such systems, computer programs can be executed Anderson et al. have proposed a scheme with per- 
m parallel by utilizing each processor simultaneously. processor ready queues. See, e.g., Thomas Anderson, Brian 
In addition, operating systems often provide facilities for Bershad, Edward Lazowska, and Henry Levy, Thread Man- 
multi-threaded programming to enhance parallelism. In agement for Shared-Memory Multi-processors, Technical 
multi-threaded programming, the execution of a computer 20 Report, Department of Computer Science and Engineering, 
program is divided into multiple threads, wherein a thread is University of Washington, 1991; and Thomas Anderson, 
a stream of instructions executed by the computer on behalf FastThreads User's Manual, Department of Computer Sci- 
of the computer program. Typically, each thread is allocated e nce and Engineering, University of Washington, Seattle, 
to a different processor, so that each of these threads is then 1990. This improves scalability by reducing contention. It 
executed in parallel at their respective separate processors, 25 also preserves processor affinity to some extent. Under this 
although multi-threaded programming can also enhance scheme, a thread may execute on the processor on which it 
parallelism on uni-processor computer systems as well. was created. However, a processor can steal a thread from 
Modern operating systems typically provide facilities for the queue of another processor. These per-processor local 
multi-threaded programming at two levels: kernel-level and queues use shared locks to permit thread stealing and so 
user pro gram -level. See, e.g., Steve Kleiman, Devang Shah, 30 incur high context switch time. 

and Bart Smaalders, Programming with Threads, Sunsoft Markatos and Leblanc did an experimental study of 
Press, Mountain View, CAlif. 1996; and Andrew scheduling strategies on the SGI™ IRIS (UMA— Uniform 
Tanenbaum, Modem Operating Systems, Prentice-Hal, Memory Access) and BBN™ Butterfly shared memory 
Englewood Cliffs, N.J., 1992. Kernel-level threads are (NUMA— Non-Uniform Memory Access) computer 
scheduled by the operating system. In addition, a kernel- systems, wherein the experiments involved combinations of 
level thread runs within a process and can be referenced by thread assignment policies with thread reassignment poli- 
other kernel-level threads. cies. See, e.g., Evangelos Markatos and Thomas LeBlanc, 
User program-level threads run on top of kernel level Load Balancing vs. Locality Management in Shared- 
threads, can be scheduled in the user program address space, 4Q Memory Multi-processors, Proceedings of the International 
and have no kernel-level data structures. Because of this, Conference on Parallel Processing, pages 258-267, August 
user program-level threads generally have lower context- 1992. Two kinds of thread assignment policies were studied: 
switch time and scheduling time as compared to kernel-level (1) load balancing (LB), where a thread is assigned to a 
threads. processor with the shortest queue, and (2) memory- 
One way of differentiating kernel-level and user program- 45 conscious scheduling (MCS), where a thread is assigned to 
level threads is that kernel-level threads depict multi- a processor whose local memory contains most of the data 
processing resources within a system, whereas user accessed by a thread. 

program-level threads model parallelism within a user pro- These were combined with three rescheduling policies to 

gram. Generally, the user program has no control over keep the processors as busy as possible: (1) Aggressive 

kernel-level threads, unless the user program comprises 50 Migration (AM), where an idle processor steals a thread 

kernel extensions or device drivers. from a processor with the longest queue; (2) No Migration 

With the increasing interest in user program-level multi- (NM, which prefers locality to migration; and (3) Beneficial 

threaded programming, a number of user program-level Migration (BM), where an idle processor searches the queue 

thread libraries have been implemented. Typical implemen- of other processors' for a thread whose migration will lower 

tations of a user program-level thread library provide facili- 55 me execution time. Note that BM is an unrealizable policy 

ties for creating and destroying threads, for waiting on a as it requires complete information about the execution 

thread to terminate, for waiting on a thread to yield itself, limes and data access patterns of the threads, 

and for blocking and unblocking a thread. In addition, The authors conclude that central queues are inadequate 

locking facilities for accessing data shared between the even on small systems. Per-processor queues by themselves 

threads in a safe manner without race conditions are often ^0 are not enough and should be combined with thread reas- 

provided. Mechanisms for thread-specific data, thread signment strategies. The authors recognize that locality 

priorities, and thread specific signal handling also may be management is an important issue as processor speeds 

provided. continue to increase at a rate faster than that of memories or 

The most significant user program- level library is the interconnect networks, 

"pthreads" library proposed by the POSIX standards com- 65 In Torrellas, Tucker, and Gupta, the authors study cache- 

raittee. See, e.g., Institute of Electrical and Electronic affinity based scheduling policies. See, e.g., Joseph 

Engineers, Inc., Information Technology — Portable Operat- Torrellas, Andrew Tucker, and Anoop Gupta, Evaluating the 
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Performance of Cache-Affinity Scheduling in Shared ule queues are used to schedule the threads based on the 

Memory Multi-processors, Journal of Parallel and Distrib- dynamically changing affinity, locality, and load balancing 

uted Computing, 22(2):139-151, February 1995. This pub- characteristics. 

lication explores affinity scheduling to reduce cache misses These and various other advantages and features of nov- 

by preferentially scheduling a process on a processor where s c i ty which characterize the invention are pointed out with 

it ran most recently. The implementation adds affinity to an particularity in the claims annexed hereto and form a part 

existing system by raising the priorities of processes that are hercof . However, for a better understanding of the invention, 

attractive from the standpoint of affinity scheduling when its advantages, and the objects obtained by its use, reference 

searching the ready queue. shouid bc made to fa c drawings which form a further part 

Steckermeier and Bellosa use locality information in user 10 hereof, and to the accompanying detailed description, in 

program-level scheduling for cache optimization in a hier- which there is illustrated and described specific examples of 

archical shared memory (NUMA) machine, like the Convex a method, apparatus, and article of manufacture in accor- 

Exemplar. See, e.g., Martin Steckermeier and Frank dance with the invention. 
Bellossa, Using Locality Information in User Level 

Scheduling, Technical Report TR-95-14, University of 15 BRIEF DESCRIPTION OF THE DRAWINGS 

Erlangenurnberg, Computer Science Department. Operating n r , A A , , . . <. 

Systems-IMMD IV, Martensstraffi, 91058 Erlmgen, R f erruie n0w . to the In „ reference 

J ^ i_ mnr a *i_ j • u j i j numbers represent corresponding parts throughout: 

Germany, December 1995. A thread is scheduled on a r r & r o 

processor in whose local memory the thread has most of its FIG 1 ^ a block diagram that illustrates an exemplary 

data. Also, two different threads which access the same data 2° hardware environment according to the preferred embodi- 

set are scheduled on the same processor. ment of the present invention; 

The COOL system provides facilities to provide affinity nG 2 ^ a block diagram that further illustrates the 

hints with tasks. See, e.g., Rohit Chandra, Anoop Gupta, and exemplary software environment according to the preferred 

John Hennessy, COOL: An Object-Based Language for embodiment of the present invention; 

Parallel Programming, Computer, pages 13-26, August 25 FIG. 3 is a block diagram that illustrates an exemplary 

1994. COOL is an parallel extension to C++ for shared- central schedule queue and per-processor local schedule 

memory parallelism that provides a variety of facilities for queues according to the preferred embodiment of the present 

locality and affinity, wherein functions marked as "parallel" invention; 

execute as separate tasks and each processor has its own task FIG. 4 is a block diagram that illustrates the grouping of 

queues. In COOL, tasks can be co -located to exploit cache 30 threads by affinity-ids in a schedule queue according to the 

affinity. Similarly, they can declared to be affine to a pro- preferred embodiment of the present invention; 

cessor to exploit processor affinity. Tasks operating on the FIG. 5 is a state diagram that illustrates the different states 

same data can also be dedared to execute back-to-back on of a thread in the computer system according to the preferred 

the same processor. However, COOL affinity specifications embodiment of the present invention; and 

are used only at task creation time and there is no way to , . „ . A . A ... i , , . 

, -c * 1 w FIG. 6 is a flowchart that illustrates exemplary logic 

change the specification as tasks are running. Moreover, , , . . , , . , . , - * •« f j 

i t j a i_ ii j u'i *- performed by the scheduler during the ready state illustrated 

COOL tasks do not have thread capabilities. f - f \ «• r ■ 

r m FIG. 5 according to the preferred embodiment of the 

Additional information on the prior art can be found in the present invention 

inventor's own thesis. See, e.g., Neelakantan Sundaresan, 4Q 

Modeling Control and Dynamic Data Parallelism in Object- DETAILED DESCRIPTION OF THE 

Oriented Languages, Ph.D thesis, Indiana University, PREFERRED EMBODIMENT 
Bloomington, September 1995. 

Although these publications evidence the research under- In thc followin g description of the preferred embodiment, 

taken in recent years, there is a need in the art for more 45 reference 15 mad e to the accompanying drawings which 

sophisticated techniques for scheduling multi-threaded user form a .P art hcrcof > and 10 which 1S showa b > wa V of 

programs, especially as multi-processor computer systems ^lustration a s P<*ific embodiment in which the invention 

become more common. Indeed, there is a need in the art for m ^ bc practiced It is to be understood that other embodi- 

scheduling techniques that fully exploit the sometimes com- ments ma y be utlhzed and structu ™l changes may be made 

peting interests of affinity, locality, and load balancing. 50 wlthout departing from the scope of the present invention. 

Further, there is a need in the art that permits such charac- Overview 
teristics to be defined and modified dynamically. 

The present invention discloses a technique for dynami- 

SUMMARY OF THE INVENTION caIly cna nging affinity, locality, and load balancing charac- 

To overcome the limitations in the prior art described 55 teristics used for processor scheduling of mreaa^based^on 

above, and to overcome other limitations that will become useT program input>According'to-the"presenTinvention, a 

apparent upon reading and understanding the present user program can specify thread affinity, locality, and load 

specification, the present invention discloses a method, balancing scheduling parameters at thread creation time and 

apparatus, and article of manufacture for dynamically can change these parameters dynamically during thread 

exploiting affinity, locality, and load balancing in scheduling <so execution. In addition, other entities with sufficient infor- 

the execution of multi -threaded user programs in a multi- mation and privileges may also specify such scheduling 

processor computer system. Affinity, locality, and load bal- parameters. 

ancing characteristics^are specified af thread creation time The preferred embodiment implements two types of 

and can be dynamically changed during thread execution, schedule queues for the execution of user program threads: 

either by the user program itself or by any other process or 65 a central schedule queue and one or more per-processor local 

entity with sufficient privileges and information. A central schedule queues, although other queue structures could be 

schedule queue and one or more per-processor local sched- used. The preferred embodiment uses the notion that, when 
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the user program has sufficient information on the data Those skilled in the art, however, will recognize that alter- 

access patterns of threads, the scheduling and context native embodiments could use different types and numbers 

switching of these threads through both the central schedule of queues, e.g., a hierarchical set of queues, where each 

queue and the per-processor local schedule queues can be queue corresponds to a subset of the set of all processors and 

made to be faster than prior art systems. 5 which define the migration domain of the threads belonging 

The present invention provides the following facilities: to the queue, without departing from the scope of the present 

Threads can be specified to have affinity to each other, so invention. 

that threads with the same affinity-ids are scheduled FIG. 4 is a block diagram that illustrates the grouping of 

back-to-back in the same processor. Threads with the threads 22 by affinity-ids 36 in a schedule queue 26 or 28. 

same affinity-ids operate on the same data and thus if 10 In the exemplary structure, threads 22 with the same 

one thread has already caused the data from memory to affinity- ids 36 are scheduled for execution back-to-back in 

appear on the cache or the memory closest to the the same processor 10, so that they can operate on the same 

processor, the other thread can execute on the same data. 

processor to reuse this cached data to reduce cache The operating system 20, threads 22, scheduler 24, sched- 
misses and improve performance. 15 uler queues 26 and 28, kernel-level thread library 30, user 
Threads can be specified to be local to a specified program 32, and user-program-level thread library 34 are 
processor, or a set of available processors, or a subset each comprised of instructions, data structures, and/or data 
of available processors. Threads with a locality of a which, when read, interpreted, and/or executed by the pro- 
specified processor are scheduled using the local sched- cessors 10, causes the processors 10 to perform the steps 
ule queue of the processor and have a low context 20 necessary to implement and/or use the preferred embodi- 
switch time because cheaper locks are involved. ment of the present invention, as described in more detail 
Threads with a locality of '-l can be executed on any below. Generally, the operating system 20, threads 22, 
available processor, to enhance load balancing. scheduler 24, scheduler queues 26 and 28, kernel-level 
The affinity and locality of threads can be dynamically 25 thread librarv 30 > user program 32, and user-program-level 
changed. A thread with certain locality and affinity ^icad hbrarv 34 are embodied in and/or readable from a 
characteristics can have its locality and affinity char- device > carner > or media > sudl as memory, data storage 
acteristics changed at any point of its execution based devices > and / or remote devices connected to the computer 
on execution or data access patterns. s y stem 8 via one 01 more data communications devices. 

3Q Thus, the present invention may be implemented as a 

Environment method, apparatus, or article of manufacture using standard 

FIG. 1 is a block diagram that illustrates an exemplary programming and/or engineering techniques to produce 

hardware environment implemented according to the pre- software, firmware, hardware, or any combination thereof, 

f erred embodiment of the present invention. In the exem- The term "article of manufacture" (or alternatively, "carrier 

plary hardware environment, a computer system 8 is typi- 35 or product") as used herein is intended to encompass logic, 

cally a symmetric multi-processor (SMP) architecture and is data structures, and/or data accessible from any device, 

comprised of a plurality of processors 10 (each of which has carrier, or media. Of course, those skilled in the art will 

a cache 12), shared random access memory (RAM) 14, and recognize many modifications may be made to this configu- 

other components, such as peripheral interfaces 16, control- ration without departing from the scope of the present 

lers 18, etc. The computer system 8 operates under the 40 invention. 

control of an operating system 20, which in turn controls the Those skilled in the art will recognize that the exemplary 

execution of one or more user program threads (UTs) 22 on environments and structures illustrated in FIGS. 1, 2, 3, and 

the various processors 10. Of course, those skilled in the art 4 are not intended to limit the present invention. Indeed, 

will recognize that the exemplary environment illustrated in those skilled in the art will recognize that other alternative 

FIG. 1 is not intended to limit the present invention. Indeed, 45 environments and structures may be used without departing 

those skilled in the art will recognize that other alternative from the scope of the present invention, 
hardware environments may be used without departing from 

the scope of the present invention. Functions of the User Program-Level Thread 
FIG. 2 is a block diagram that further illustrates the 



Library 



exemplary software environment implemented according to 50 The user program-level thread library 34 provides a 

the preferred embodiment of the present invention. In the number of different functions that permit threads 22 to be 

exemplary software environment, the operating system 20 created and deleted in large numbers during the lifetime of 

includes a scheduler 24, a central schedule queue 26, one or a user program 32. When a user program 32 requests the 

more per-processor local scheduler queues 28, and a kernel- creation of a thread 22 through the user program-level 

level thread library 30. The threads 22 result from the 55 library 34, it specifies the function that this thread 22 

execution of a user program 32 in conjunction with a user executes, the argument to be passed to this function, the 

program-level thread library 38 that provides support for stack information for the thread 22, and any other optimi- 

multi-threaded operations. zation arguments related to affinity, locality, and load- 

FIG. 3 is a block diagram that illustrates an exemplary balancing, 

central schedule queue 26 and per-processor local schedule 60 The stack information and the optimization arguments are 

queues 28 implemented according to the preferred embodi- optional and the library 34 assumes a default value if not 

ment of the present invention. Threads 22 resident in the specified. For example, the user program 32 may provide its 

central schedule queue 26 may be scheduled for execution own stack pointer, or specify that the thread 22 uses a stack 

on any available processor 10 by the scheduler 24, while of some specific size, or relegate the decision to the library 

threads 22 resident in the per-processor local schedule 65 34. Creating a thread 22 involves allocating a stack for the 

queues 28 may be scheduled for execution only on the thread 22, initializing its context data-structures, and adding 

specified processor 10 (when available) by the scheduler 24. the thread 22 to the appropriate scheduling queue 26 or 28. 
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The thread 22 remains in the schedule queue 26 or 28 until thread ID() returns the pointer corresponding to the 

it is dispatched for execution by the scheduler 24^ ^ ^ ^ ^ ^ £ ^ ^ ^ 

^.S C ^ h ^ ad i?,fn^ te /„ a ^ CUOn L M ^-^^°^ thread-specific output information, such as performance 

(INT ARGC, CHAR** ARGV) in the user-program-level analysis and visualLtion. 

toead library 34, which is a user program-specified func- s ^ mread 22 itself can ch its own migration domain 

Uon. This function specifies the mam procedure of the toread ^ tQe MIGRATEfUNSIGNED 

f '? L^M^ e AD^u A D*. S ^^^ ay M -4, UlC * C LONG DOMAIN BITS) in the library 34.. If the thread 22 

function MAIN(INT ARGC, CHAR* * ARGV) specifies the can execule Qn a ^ w whose ^ ^ fc (hen 

main procedure in a sequential environment. ^ pth Wt righ , m me ^ is Mt , o j 

A function NUM_i>ROCESSOR() in the library 34 io However, when a thread 22 is holding a mutex it is not 

returns the number of processors 10 available in the system permitted to migrate to another processor 10 until it releases 

8- the mutex. 

The thread 22 can identify the processor 10 on which it is T ,. , _ , 

executing by invoking the function THIS_PROCESSOR m6 Balancln S 

■ i l i L ^ rr» . r 4 , t, rti In the preferred embodiment of the present invention, a 

in the library 34. This function returns a value from 0 to 15 ^ ^ can ^ (>) non . s(icky (sched ^ lable on my of the 

NUM_PROCESSORS0-l. In one embodiment, the mim- available processors 10); (b) part-sticky_(schedulable on a 

ber of processors 10 that may used by the thread 22 is set or subset of the available processors 10); or (c) sticky 

specified at invocation time, but alternative embodiments (schedulable only on a specified processor 10 and not 

may make it dynamic, i.e., so that processors 10 can be migratable at all). The stickiness of a thread 22 can be 

added and deleted dynamically. 20 changed dynamically by the thread 22 itself, or any other 

Context switching between threads 22 can occur when a process or entity with sufficient rights and information, using 

thread 22 yields, or blocks, or exits. When a thread 22 yields the user program-level thread library 34 to alter the locality 

, . . . . f . ™j DCAri vtott^a* ,u i u and' load balancing characteristics of the thread 22. 

by invoking the function THREAD YIELD fl in the library Kr 4 . . * . . . , „ , . , , . 

34, it gets pushed into a schedule^ 26 or 28 and it 25 "^ft* [*? P art " sUc ^ ^ 22 4re scheduled using 

switches context with another thread 22 from the schedule |he antral schedule queue 26 which ensures load balancmg. 

~, - 0 The use of the central schedule queue 26 implies that a 

queue 26 or 28. ., , ., . . . n * 

n „„ t j L1 , . L ^ . thread 22 that can migrate among processors 10. 

When a thread 22 blocks by invoking the function „ , , 

J & However, non-sticky and part-sticky threads 22 are more 

THREAD_BLOCK() in the library 34, it switches context expensive to schedule and context-switch than sticky threads 

with another thread 22 from the schedule queue 26 or 28. 22 that are scheduled using the per-processor local schedule 

The thread 22 is unblocked by another thread 22 that queues 28. The reason is that the data structures used to 

invokes the function THREAD__UNBLOCK(THREAD manage the central schedule queue 26 involve shared 

T*T) in the library 34, which requires a pointer of type memory locks to avoid race conditions in a multi-processing 

THREAD T*. This causes the thread 22 to be added to the context. 

schedule queue 26 or 28. The difference between a blocked 35 fa contrast, the per-processor Local schedule queues 28 do 

thread 22 and a yielding thread 22 is that a blocked thread not. need any locks to update their data structures. Thus, 

22 gives up the processor 10, but does not go back to the sticky threads 22 have a lower context-switch time than 

schedule queue 26 or 28 and has to be explicitly unblocked migratable threads 22. 

to be put back into the schedule queue 26 or 28, while a A thread 22 that has sufficient knowledge about its execu- 

yielding thread 22 is automatically re-scheduled when it uon or data access patterns can control its migration. On the 

relinquishes the processor 10. other hand> Dy ma king a thread 22 sticky, better cache 

A thread 22 exits or terminates by calling the function utilization can be achieved. Also, if the user program 32 does 

THREAD_EXIT(VOID* RET_VAL) in the library 34, not have an inherent load imbalance, significant advantages 

wherein the variable RET_VALis the value returned by the can be achieved by distributing the threads 22 so that each 

thread 22. When the thread 22 exits or terminates, the 4 executes on a specific processor 10. 

scheduler 24 switches its context with another thread 22 . 

from the schedule queue 26 or 28. Thread 

A thread 22 can wait for another thread 22 to finish ^ concept of locality can be extended to the use of 
executing by calling the function THREAD_JOINT affinity-ids 36 for threads 22. One or more threads 22 can 
(THREAD T*T, VOID**RET_VAL P) in the library 34. 50 have a specified affinity-id 36, so that threads 22 with the 
The allocated resources for the thread 22 are cleaned up after specified affinity-id 36 are executed back-to-back in a pro- 
it has terminated and after another thread 22 has joined it. cessor 10 Appropriate scheduling is done by clustering the 
For sake of correctness and efficiency, the user program- threads 22 with the same affinity-id 36 together in a per- 
level thread library 34 allows a thread 22 TA to "join" TA processor local schedule queue 28. Thus, when a thread 22 
only if TA is the creator of TB. 55 blocks, the next thread 22 to be executed is a thread 22 with 

A thread 22 can identify itself by invoking the function the same 36 * If there is no such thread 22, then 

another thread 22 with a different affinity-id 36 is executed. 

THIS_THREAD() in the library 34, which returns a If mere is no eligible thread 22 in the per-processor local 

pointer of type THREAD T*. schedule queue 28, then a thread 22 from the central 

The total number of threads 22 on the system 8 at any 60 scne dule queue 26 is dispatched for execution. When a 

point of time can be obtained by calling the function thread 22 yields, it goes back into its cluster of threads 22 

NUM_THREADS() in the library 34. Threads 22 are also w\h the same affinity-id 36 on the appropriate per-processor 

identified by logical numbers from 0 to NUM_THREADS local schedule queue 28. 

65 State Diagram of a Thread 

0 The function THREAD_ID(THREAD T*) in the library FIG. 5 is a state diagram that illustrates the different states 

34 returns a pointer to a thread 22 and the function THIS_ of a thread 22 in the computer system 8 according to the 
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preferred embodiment of the present invention. Of course, 
those skilled in the art will recognize that other logic could 
be used without departing from the scope of the present 
invention. 

A thread 22 is first initialized at state 38, where it may 
specify its affinity, locality, and load balancing characteris- 
tics. From state 38, a transition is made to state 40, where the 
scheduler 24 places the thread 22 in a schedule queue 26 or 
28, according to its affinity, locality, and load balancing 
characteristics. From state 40, threads 22 may transition to 
state 42 to wait for resources or state 44 to execute. From 
state 42, the thread transitions back to state 40. From state 
44, the thread 22 transitions back to state 40 when it yields 
or is preempted, to state 46 when it is blocked, or to state 48 
when it exits or terminates. During these state transitions, 
the thread 22 may alter its affinity, locality, or load balancing 
characteristics to effect the operation of the scheduler 24 at 
state 40. 

Logic of the Scheduler 

FIG. 6 is a flowchart that illustrates exemplary logic 
performed by the scheduler 24 during the ready state 40 
according to the preferred embodiment of the present inven- 
tion. Of course, those skilled in the art will recognize that 
other logic could be used without departing from the scope 
of the present invention. 

Block 50 represents the scheduler 24 waiting for the next 
event occur. When an event does occur, such as an I/O event, 
etc., the logic of Blocks 52-68 is performed Block 52 is a 
decision block that represents the scheduler 24 determining 
whether the event was a notification that a thread 22 requires 
scheduling (e.g., another thread 22 is being de-scheduled). If 
not, control transfers to Block 54, which represents the 
scheduler 24 performing other processing and then to Block 
50; otherwise, control transfers to Block 56. 

Block 56 represents the scheduler 24 searching for a 
group of threads 22 in one of the schedule queues 26 or 28 
having with the same affinity-id 36 as that specified for the 
thread 22 being de -scheduled. 

Block 58 is a decision block that represents the scheduler 
24 determining whether a thread 22 with the same affinity-id 
36 was found. If so, control transfers to Block 60; otherwise, 
control transfers to Block 62. 

Block 60 represents the scheduler 24 scheduling the 
thread 22 with the same affinity-id 36 for execution on the 
processor 10. Thereafter, control transfers back to Block 50. 

Block 62 represents the scheduler 24 searching for a 
thread 22 in one of the schedule queues 26 or 28 having with 
the same locality as that specified for the thread 22 being 
de-scheduled. 

Block 64 is a decision block that represents the scheduler 
24 determining whether a thread 22 with the same local was 
found. If so, control transfers to Block 66; otherwise, control 
transfers to Block 68. 

Block 66 represents the scheduler 24 scheduling the 
thread 22 with the same locality for execution on the 
processor 10. Thereafter, control transfers back to Block 50. 

Block 68 represents the scheduler 24 scheduling some 
other eligible thread 22 from the central queue 26 for 
execution on the processor 10. Thereafter, control transfers 
back to Block 50. 

Conclusion 

This concludes the description of the preferred embodi- 
ment of the invention. The following describes some alter- 
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native embodiments for accomplishing the present inven- 
tion. For example, any type of computer, such as a 
mainframe, minicomputer, or personal computer, could be 
used with the present invention. In addition, any software 
5 program adhering (either partially or entirely) to the tech- 
nique of multi-threading could benefit from the present 
invention. 

In summary, the present invention discloses a method, 
apparatus, and computer program carrier for dynamically 

30 exploiting affinity, locality, and load balancing for schedul- 
ing execution of multi-threaded user programs in a multi- 
processor computer system. Affinity, locality, and load bal- 
ancing characteristics are specified at thread creation time 
and can be dynamically changed during thread execution, 

15 either by the user program itself or by any other process or 
entity with sufficient privileges and information. A central 
schedule queue and one or more per-processor local sched- 
ule queues are used to schedule the threads based on the 
affinity, locality, and load balancing characteristics. 

20 The foregoing description of the preferred embodiment of 
the invention has been presented for the purposes of illus- 
tration and description. It is not intended to be exhaustive or 
to limit the invention to the precise form disclosed. Many 
modifications and variations are possible in light of the 

25 above teaching. It is intended that the scope of the invention 
be limited not by this detailed description, but rather by the 
claims appended hereto. 
What is claimed is: 

I. A method of scheduling thread execution in a computer, 
30 comprising the steps of: 

(a) creating a thread in a memory of a computer, 

(b) specifying one or more scheduling characteristics of 
the thread; 

35 (c) scheduling execution of the thread in the computer in 
accordance with the specified scheduling characteris- 
tics; and 

(d) modifying one or more of the scheduling character- 
istics of the thread during the execution of the thread. 
40 2. The method of claim 1, wherein the specified sched- 
uling characteristics are selected from a group comprising a 
locality characteristic, an affinity characteristic, and a load 
balancing characteristic. 

3. The method of claim 2, wherein the locality character- 
45 istic indicates that the thread is executed by a specified 

processor. 

4. The method of claim 3, further comprising the step of 
scheduling the thread for execution using a local schedule 
queue of the specified processor. 

50 5. The method of claim 2, wherein the locality character- 
istic indicates that the thread is local to a subset of available 
processors. 

6. The method of claim 2, wherein the locality character- 
istic indicates that the thread can execute on any processor. 
55 7. The method of claim 2, wherein the affinity character- 
istic indicates that the thread can reuse data from a prior 
thread. 

8. The method of claim 2, wherein the affinity character- 
istic indicates that the thread is affine to another thread. 
60 9. The method of claim 8, further comprising the step of 
scheduling the affine threads for execution back-to-back in 
a processor when the affine threads have a same locality 
characteristic. 

10. The method of claim 8, wherein the affine threads 
65 operate on shared data. 

II. Method of claim 8, further comprising the step of 
scheduling a second thread for execution in a processor after 
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a first thread has completed its execution, when the first and 
second threads are affine threads, so that the second thread 
can reuse cached data in the processor. 

12. The method of claim 1, wherein the specifying step 
comprises the step of specifying one or more of the specified 
scheduling characteristics of the thread during execution of 
the thread. 

13. The method of claim 12, wherein the scheduling 
characteristics are specified b the thread itself. 

14. The method of claim 1, wherein the specifying step 
comprises the step of specifying one or more of the sched- 
uling characteristics of the thread when the thread is created. 

15. The method of claim 14, wherein the scheduling 
characteristics are specified by the thread itself. 

16. The method of claim 1, wherein the scheduling 
characteristics are modified by the thread itself. 

17. The method of claim 1, wherein the scheduling 
characteristics are modified b an entity other than the thread. 

18. The method of claim 1, wherein the modifying step 
comprises the step of dynamically modifying one or more of 
the scheduling characteristics of the thread during execution 
of the thread based on an operation of the thread. 

19. The method of claim 1, wherein the modifying step 
comprises the step of dynamically modifying one or more of 
the scheduling characteristics of the thread during execution 
of the thread based on a data access pattern of the thread. 

20. The method of claim 1, wherein the specifying step 
comprises the step of specifying one or more scheduling 
characteristics of the thread via a user-level thread library. 

21. The method of claim 20, wherein the user-level thread 
library provides facilities to specify the affinity, locality, and 
load balancing characteristics used in scheduling the thread 
for execution. 

22. The method of claim 1, wherein the scheduling step 
comprises the step of scheduling execution of the thread in 
the computer in accordance with the specified scheduling 
characteristics using a central schedule queue and a per- 
processor local schedule queue. 

23. The method of claim 1, wherein the thread is non- 
sticky and thus is schedulable on any available processor. 

24. The method of claim 1, wherein the thread is part- 
sticky and thus is schedulable on a set or subset of available 
processors. 

25. The method of claim 1, wherein the thread is sticky 
and thus is executed only on a specified processor. 

26. A multi-threaded computer system, comprising: 

(a) one or more processors; 

(b) means, performed by one of the processors, for 
creating a thread in a memory of the computer system; 

(c) means, performed by one of the processors, for 
specifying one or more scheduling characteristics of the 
thread; 

(d) means, performed by one of the processors, for 
scheduling execution of the thread in the computer in 
accordance with the specified scheduling characteris- 
tics; and 

(e) means, performed by one of the processors, for 
modifying one or more of the scheduling characteristics 
of the thread during the execution of the thread. 

27. The system of claim 26, wherein the specified sched- 
uling characteristics are selected from a group comprising a 
locality characteristic, an affinity characteristic, and a load 
balancing characteristic. 

28. The system of claim 27, wherein the locality charac- 
teristic indicates thai the thread is executed by a specified 
processor. 
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29. The system of claim 28, further comprising means for 
scheduling the thread for execution using a local schedule 
queue of the specified processor. 

30. The system of claim 27, wherein the locality charac- 
5 teristic indicates that the thread is local to a subset of 

available processors. 

31. The system of claim 27, wherein the locality charac- 
teristic indicates that the thread is can execute on any 
processor. 

32. The system of claim 27, wherein the affinity charac- 
teristic indicates that the thread can reuse data from a prior 
thread. 

33. The system of claim 27, wherein the affinity charac- 
teristic indicates that the thread is affine to another thread. 

34. The system of claim 33, further comprising means for 
35 scheduling the affine threads for execution back- to -back in 

a processor when the affine threads have a same locality 
characteristic. 

35. The system of claim 33, wherein the affine threads 
operate on shared data. 

20 36. The system of claim 33, further comprising means for 
scheduling a second thread for execution in a processor after 
a first thread has completed its execution, when the first and 
second threads are affine threads, so that the second thread 
can reuse cached data in the processor. 

25 37. The system of claim 26, wherein the means for 
specifying comprises means for specifying one or more of 
the specified scheduling characteristics of the thread during 
execution of the thread. 

38. The system of claim 37, wherein the scheduling 
characteristics are specified by the thread itself. 

39. The system of claim 26, wherein the means for 
specifying comprises means for specifying one or more of 
the scheduling characteristics of the thread when the thread 
is created. 

40. The system of claim 39, wherein the scheduling 
35 characteristics are specified by the thread itself. 

41. The system of claim 26, wherein the scheduling 
characteristics are modified by the thread itself. 

42. The system of claim 26, wherein the scheduling 
characteristics are modified by an entity other than the 

40 thread. 

43. The system of claim 26, wherein the means for 
modifying comprises means for dynamically modifying one 
or more of the scheduling characteristics of the thread during 
execution of the thread based on an operation of the thread. 

45 44. The system of claim 26, wherein the means for 
modifying comprises means for dynamically modifying one 
or more of the scheduling characteristics of the thread during 
execution of the thread based on a data access pattern of the 
thread. 

50 45. The system of claim 26, wherein the means for 
specifying comprises means for specifying one or more 
scheduling characteristics of the thread via a user-level 
thread library. 

46. The system of claim 45, wherein the user-level thread 
55 library provides facilities to specify the affinity, locality, and 

load balancing characteristics used in scheduling the thread 
for execution. 

47. The system of claim 26, wherein the means for 
scheduling comprises means for scheduling execution of the 

60 thread in the computer in accordance with the specified 
scheduling characteristics using a central schedule queue 
and a per-processor local schedule queue. 

48. The system of claim 26, wherein the thread is non- 
sticky and thus is schedulable on any available processor. 

65 49. The system of claim 26, wherein the thread is part- 
sticky and thus is schedulable on a set or subset of available 
processors. 
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50. The system of claim 26, wherein the thread is sticky 
and thus is executed only on a specified processor. 

51. A carrier embodying logic for scheduling thread 
execution in one or more processors, the logic comprising 
the steps of: 

(a) creating a thread in a memory of the processor; 

(b) specifying one or more scheduling characteristics of 
the thread; 

(c) scheduling execution of the thread in one or more of 
the processors in accordance with the specified sched- 
uling characteristics; and 

(d) modifying one or more of the scheduling character- 
istics of the thread during the execution of the thread. 

52. The method of claim 51, wherein the specified sched- 
uling characteristics are selected from a group comprising a 
locality characteristic, an affinity characteristic, and a load 
balancing characteristic. 

53. The method of claim 52, wherein the locality char- 
acteristic indicates that the thread is executed by a specified 
processor. 

54. The method of claim 53, further comprising the step 
of scheduling the thread for execution using a local schedule 
queue of the specified processor. 

55. The method of claim 52, wherein the locality char- 
acteristic indicates that the thread is local to a subset of 
available processors. 

56. The method of claim 52, wherein the locality char- 
acteristic indicates that the thread is can execute on any 
processor. 

57. The method of claim 52, wherein the affinity charac- 
teristic indicates that the thread can reuse data from a prior 
thread. 

58. The method of claim 52, wherein the affinity charac- 
teristic indicates that the thread is affine to another thread. 

59. The method of claim 58, further comprising the step 
of scheduling the afiine threads for execution back-to-back 
in a processor when the afiine threads have a same locality 
characteristic. 

60. The method of claim 58, wherein the afiine threads 
operate on shared data. 

61. The method of claim 58, further comprising the step 
of scheduling a second thread for execution in a processor 
after a first thread has completed its execution, when the first 
and second threads are affine threads, so that the second 
thread can reuse cached data in the processor. 
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62. The method of claim 51, wherein the specifying step 
comprises the step of specifying one or more of the specified 
scheduling characteristics of the thread during execution of 
the thread. 

63. The method of claim 62, wherein the scheduling 
characteristics are specified by the thread itself. 

64. The method of claim 51, wherein the specifying step 
comprises the step of specifying one or more of the sched- 
uling characteristics of the thread when the thread is created. 

65. The method of claim 64, wherein the scheduling 
characteristics are specified by the thread itself. 

66. The method of claim 51, wherein the scheduling 
characteristics are modified by the thread itself. 

67. The method of claim 51, wherein the scheduling 
characteristics are modified by an entity other than the 
thread. 

68. The method of claim 51, wherein the modifying step 
comprises the step of dynamically modifying one or more of 
the scheduling characteristics of the thread during execution 
of the thread based on an operation of the thread. 

69. The method of claim 51, wherein the modifying step 
comprises the step of dynamically modifying one or more of 
the scheduling characteristics of the thread during execution 
of the thread based on a data access pattern of the thread. 

70. The method of claim 51, wherein the specifying step 
comprises the step of specifying one or more scheduling 
characteristics of the thread via a user-level thread library. 

71. The method of claim 70, wherein the user-level thread 
library provides facilities to specify the affinity, locality, and 
load balancing characteristics used io scheduling the thread 
for execution. 

72. The method of claim 51, wherein the scheduling step 
comprises the step of scheduling execution of the thread in 
the computer in accordance with the specified scheduling 
characteristics using a central schedule queue and a per- 
processor local schedule queue. 

73. The method of claim 51, wherein the thread is 
non-sticky and thus is schedulable on any available proces- 
sor. 

74. The method of claim 51, wherein the thread is 
part-sticky and thus is schedulable on a set or subset of 
available processors. 

75. The method of claim 51, wherein the thread is sticky 
and thus is executed only on a specified processor. 
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